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Abstract 

In response to a 1997 problem of M. Vidyasagar, we state a criterion for PAC learnability of a concept 
class under the family of all non- atomic (diffuse) measures on the domain fl. The uniform Glivenko- 
Cantelli property with respect to non-atomic measures is no longer a necessary condition, and consistent 
learnability cannot in general be expected. Our criterion is stated in terms of a combinatorial parameter 
VC^modwi) which we call the VC dimension of ^ modulo countable sets. The new parameter is obtained 
by "thickening up" single points in the definition of VC dimension to uncountable "clusters" . Equivalently, 
VC(^moda;i) < d if and only if every countable subclass of c & has VC dimension < d outside a countable 
subset of f2. The new parameter can be also expressed as the classical VC dimension of ^ calculated on a 
suitable subset of a compactification of Q. We do not make any measurability assumptions on assuming 
instead the validity of Martin's Axiom (MA). Similar results are obtained for function learning in terms 
of fat-shattering dimension modulo countable sets, but, just like in the classical distribution-free case, the 
finiteness of this parameter is sufficient but not necessary for PAC learnability under non-atomic measures. 
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Martin's Axiom, VC dimension modulo countable sets, fat shattering dimension modulo countable sets 
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1. Introduction 

A fundamental result of statistical learning theory says that under some mild measurability assumptions 
on a concept class ^ the three conditions are equivalent: (1) ^ is distribution-free PAC learnable over the 
family P(f2) of all probability measures on the domain fl, (2) ^ is a uniform Glivenko-Cantelli class with 
respect to P(Q), and (3) the Vapnik-Chervonenkis dimension of 'rf is finite 17|, [18114. In this paper we are 



interested in the problem, discussed by Vidyasagar in both editions of his book 19. 20] as problem 12.8. of 
giving a similar combinatorial description of concept classes ^ which are PAC learnable under the family 
Pna(Q) of all non-atomic probability measures on Q. (A measure /i is non-atomic, or diffuse, if every set A 
of strictly positive measure contains a subset B with < /J.(B) < n(A).) 

The condition VC(^) < oo, while of course sufficient for ^ to be learnable under P„ a (0), is not necessary. 
Let a concept class ^ consist of all finite and all cofinite subsets of a standard Borel space f2. Then 
VC^) = oo, and moreover ^ is clearly not a uniform Glivenko-Cantelli class with respect to non-atomic 
measures. At the same time, ^ is PAC learnable under non-atomic measures: any learning rule C consistent 
with the subclass {0, £1} will learn . Notice that ^ is not consistently learnable under non-atomic measures: 
there are consistent learning rules mapping every training sample to a finite set, and they will not learn any 
cofinite subset of fi. 
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The most salient feature of this example is that PAC learnability of a concept class ^ under non-atomic 
measures is not affected by adding to symmetric differences CAN for each C £ 'io and every countable 
set N. 

A version of VC dimension oblivious to this kind of set-theoretic "noise" is obtained from the classical 
definition by "thickening up" individual points and replacing them with uncountable clusters (Figure Q]). 



c 




Figure 1: A family Ax, A2, . . . , A n of uncountable sets shattered by V . 



Define the VC dimension of a concept class modulo countable sets as the supremum of natural n for 
which there exists a family of n uncountable sets, A\, A2, . ■ . , A n C ft, shattered by ^ in the sense that for 
each J C {1,2, ... ,n}, there is C £ which contains all sets Aj, i £ J, and is disjoint from all sets Aj, 
j ^ J. Denote this parameter by VC^ modwi). Clearly, for every concept class 

VC(^modo;i) < VC(^). 

In our example above, one has VC('^'modwi) = 1, even as VC(^) = oo. 

Our main theorem for PAC concept learning under non-atomic measures requires an additional set- 
theoretic hypothesis, Martin's Axiom (MA) [8, 9, 11]. This is one of the most often used and best studied 



additional set-theoretic assumptions beyond the standard Zermclo-Frenkel set theory with the Axiom of 
Choice (ZFC). Here is one of the equivalent forms. Let B be a Boolean algebra satisfying the countable 
chain condition (that is, every family of pairwise disjoint elements of B is countable). Then for every family 
X of cardinality < 2 N ° of subsets of B there is a maximal ideal £ (element of the Stone space of B) with the 
property: each X £ X disjoint from £ admits an upper bound x ^ £. 

The above conclusion holds unconditionally if X is countable (due to the Baire Category Theorem) , and 
thus Martin's Axiom follows from the Continuum Hypothesis (CH). At the same time, MA is compatible 
with the negation of CH, and in fact it is namely the combination MA+^CH that is really interesting. As 
a consequence of Martin's Axiom, the usual sigma-additivity of a measure can be strengthened as follows: 
the union of < 2 N ° Lebesgue measurable sets is Lebesgue measurable. Essentially, this is the only property 
we need in the proof of the following result. 

Theorem 1.1. Let (VL,s/) be a standard Borel space, and let ^ £ si be a concept class. Under Martin's 
Axiom, the following are equivalent. 

1. ^ is PAC learnable under the family of all non-atomic measures. 

2. VC^modoj!) =d<oo. 

3. Every countable subclass c (o' C ^ has finite VC dimension on the complement to some countable subset 
of fi (which depends on 

4. There is d such that for every countable "«?' C e tf one has VC( c <o") < d on the complement to some 
countable subset of O (depending on c €'). 

5. Every countable subclass C ^ is a uniform Glivenko-Cantelli class with respect to the family of 
non-atomic measures. 

6. Every countable subclass C c $ is a uniform Glivenko-Cantelli class with respect to the family of 
non-atomic measures, with sample complexity s(e,S) which only depends on and not on c £ l . 

If c $ is universally separable fla l- the above are also equivalent to: 
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7. VC dimension of ' c € is finite outside of a countable subset of fl. 

8. c € is a uniform Glivenko-Cantelli class with respect to the family of non-atomic probability measures. 

9. is consistently PAC learnable under the family of all non-atomic measures. 

Notice that for universally separable classes, (H}-(9) are pairwise equivalent without additional set- 
theoretic assumptions. (A class ^ is universally separable if it contains a countable subclass t £' which is 
universally dense: for each C € ^ there is a sequence (C„), C„ € c £' ' , such that the indicator functions Ic„ 
converge to Ic pointwise.) The concept class in the above example (which is even image admissible Souslin 
0, but not universally separable) shows that in general (7), (8) and (9) are not equivalent to the remaining 
conditions. 

The core of Theorem ll.il — and the main technical novelty of our paper — is the proof of the implication 
0=>(H]). It is based on a special choice of a consistent learning rule C having the property that for every 
concept C6^, the image of all learning samples of the form (a, C fl a) under C forms a uniform Glivenko- 
Cantelli class. It is for establishing this property of C that we need Martin's Axiom. 

Most of the remaining implications are relatively straightforward adaptations of the standard techniques 
of statistical learning. Nevertheless, requires a certain technical dexterity, and we study this 

implication in the setting of Boolean algebras. 

An analog of Theorem 1 1.1 1 also holds for PAC learning of function classes. In this case, we are employing 
a version of fat shattering dimension [l| , which we call fat shattering dimension modulo countable sets and 
denote fat £ (j^" modwi). However, just like in the classical case, finiteness of this combinatorial parameter at 
every scale e > 0, while sufficient for PAC learnability of a function class & under non-atomic measures, is 
not necessary. It is easy to construct a function class & with fat e (j^" modwi) = oo which is distribution-free 
probably exactly learnable (Example 1 7. 3[) . 

Recall that a function / : X — > Y between two measurable spaces (sets equipped with sigma-algebras of 
subsets) is universally measurable if for every measurable subset A C Y and every probability measure fi on 
X the set f~ 1 (A) is /j,-measurable. For instance, Borel functions are universally measurable. 

Theorem 1.2. Let fl be a standard Borel space, and let & be a class of universally measurable functions 
on fl with values in [0, 1]. Consider the following conditions. 

1. & is PAC learnable under the family of all non-atomic measures. 

2. For every e > 0, fat e (jF mod Wi) = d(e) < oo. 

3. For each e > 0, every countable subclass C & has finite e-fat shattering dimension on the comple- 
ment to some countable subset of fl (which depends on 

4. There is a function d(e) such that for every countable C & and all e > one has fat e (j^*') < d(e) 
on the complement to some countable subset of fl (depending on 

5. Every countable subclass C & is a uniform Glivenko-Cantelli class with respect to the family of 
non-atomic measures. 

6. Every countable subclass C & is a uniform Glivenko-Cantelli class with respect to the family of 
non-atomic measures, with sample complexity s(e,5) which only depends on & and not on . 

The conditions {H)-(0|) are pairwise equivalent, and under Martin's Axiom each of them implies flj). If ^ 
is universally separable, the conditions are a ^ so equivalent to: 

7. For each e > 0, e-fat shattering dimension of ^ is finite outside of a countable subset of fl. 

8. & is a uniform Glivenko-Cantelli class with respect to the family of non- atomic probability measures, 

and each of them implies 

9. & is consistently PAC learnable under the family of all non-atomic measures. 

We begin the paper by reviewing a general formal setting for PAC learnability, after which we proceed 
to analysis of a well-known example of a concept class of VC dimension 1 which is not a uniform Glivenko- 
Cantelli class and is not consistently PAC learnable 0, 0] ■ The example was originally constructed under the 
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Continuum Hypothesis, though in fact Martin's Axiom suffices. We observe that the class ^ in the example 
is still PAC learnable, and this observation provides a clue to our approach to constructing learning rules. 

This analysis is followed by a series of general results about PAC learnability of a function class under 
non-atomic measures under Martin's Axiom and without making any assumptions on measurability of & 
except the measurability of individual members / of the class. 

In the two sections to follow, we discuss Boolean algebras which appear to provide a useful framework for 
studying concept learning under intermediate families of measures, and commutative C*-algebras and their 
spaces of maximal ideals, which provide a similar convenient framework for function classes. In particular, we 
will show that for a concept class ^€ our version of the VC dimension modulo countable sets, VC mod wi ) , is 
just the usual VC dimension of the family of closures, cl(C), of all C € ^ , taken in a suitable compactification 
bfl of SI and computed over a certain subdomain of &S1, as illustrated in Figure [21 



A similar result holds for the fat shattering dimension. 

At the next stage we establish the corresponding parts of Theorems 11.11 and 1 1 . 2 1 for universally separable 
classes, at which moment we have all the machinery needed to accomplish the general case. 

A conference version of this paper [l3j treated the case of concept classes, but we believe that the 
presentation of our approach has now improved considerably. 

2. The setting 

We need to fix a precise setting, which is mostly standard [l!| [2(|, see also [l], 0, [l2|. The domain 
{instance space) SI = (fi, srf) is a measurable space, that is, a set fi equipped with a sigma-algebra of subsets 
srf '. Typically, SI is assumed to be a standard Borel space, that is, a complete separable metric space equipped 
with the sigma-algebra of Borel subsets. We will clarify the assumption whenever necessary. 

In the learning model, a set V of probability measures on SI is fixed. Usually either V — -P(Sl) is the set 
of all probability measures (distribution- free learning), or V — {/i} is a single measure (learning under fixed 
distribution). In our article, the case of interest is the family V = P raa (Sl) of all non-atomic measures. 

We will not distinguish between a measure /x and its Lebesgue completion, that is, an extension of /i 
over the larger sigma-algebra of Lebesgue measurable subsets of O. Consequently, we will sometimes use 
the term measurability meaning Lebesgue measurability. No confusion can arise here. 

A function class, & ' , is a family of functions from SI to the unit interval [0, 1] which are measurable with 
regard to every For instance, elements of & can be universally measurable, or most often Borel. A 

concept class, is a function class with values in {0, 1} or, equivalently, a family of measurable subsets of 




cl(C) 



a subdomain 
of bQ. 



Figure 2: VC(^ modoji) via the usual VC dimension of V. 



ft. 



Every probability measure /i on SI determines an L 1 distance between functions: 
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For concept classes, this reduces to the following metric: 

d„(A,B) = fi(AAB). 

Often it is convenient to approximate the functions from & with elements of the hypothesis space, 
which is, technically, a family of functions whose closure in each space L 1 (/^), contains M '. However, 

in our article we make no distinction between and & . 

A learning sample is a pair [a, r), where a is a finite subset of il and r is a function from a to [0, 1]. It 
is convenient to assume that elements x\, x%, . . . , x n £ cr are ordered, and thus the set of all samples (a, r) 
with \cr\ — n can be identified with (fi x [0, 1])". In the case of concept classes, a learning sample is simply 
a pair (a, r) of finite subsets of fi, where r C a is thought of as the set of points where r takes the value 1. 
The set of all samples of size n in this case is (f2 x {0, 1})". 

A learning rule (for J^") is a mapping 

oo 

C: (J Q n x [0,1]" -> & 

n=l 

which satisfies the following measurability condition: for every / € & and p, £ V, the function 

f>9cr^ \\£(a,f \a)-f\\ 1 eM (2.1) 

is measurable. 

A learning rule C is consistent (with a function class J^") if for every / £ & and each a £ £l n one has 

C(aJ\a) \a = f\a. 

In the case of a concept class ^, the consistency condition becomes this: for every and each a £ f2 n 

one has 

£(tJ, Cncr)nf7 = c , ncr. 
A learning rule C is probably approximately correct (PACT) under V if for every e > 

sup sup y® n {a £ Q n : \\£(a, f \ a) - > e} -> as n -+ oo. (2.2) 

Here denotes the (Lebesgue extension of the) product measure on O™ . Now the origin of the measura- 
bility condition (|2.1I) on the mapping C is clear: it is implicit in (|2.2p . 

Equivalently, there is a function s(e, 5) (sample complexity of C) such that for each / £ and every 
/i G V an i.i.d. sample cr with > s(e, 5) points has the property ||£(c, / \ o~) — /||j < e with confidence 

> 1 - S. 

In particular, for a concept class it is convenient to rewrite the definition of a PAC learning rule thus: 
for each e > 0, 

sup sup [j® n {cr £ fT : fj, (£(cr, C n cr) A C) > e} -> as n -> oo. (2.3) 

In terms of the sample complexity function s(e, 5), a learning rule C is PAC if for each C £ c € and every 
H £ V an i.i.d. sample cr with > s(e, S) points has the property ji(C A £(o~, C D a)) < e with confidence 

> 1 - S. 

A function class & is PAC learnable under V, if there exists a PAC learning rule for & (<£) under V '. 
A class ^ is consistently learnable (under V) if every learning rule consistent with & is PAC under V . If 
V = P(Cl) is the set of all probability measures, then & is said to be (distribution-free) PAC learnable. If 
P = {/z} is a single probability measure, one is talking of learning under a single measure (or distribution). 
These definitions apply in particular to concept classes as well. Learnability under intermediate families of 
measures on tt has received considerable attention, cf. Chapter 7 in 20] . 
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Notice that in this paper, we only talk of potential PAC learnability, adopting a purely information- 
theoretic viewpoint. As a consequence, our statements about learning rules are existential rather than 
constructive, and building learning rules by transfinite recursion is perfectly acceptable. 

An important concept is that of a uniform Glivenko-Cantelli function class with respect to a family of 
measures V, that is, a function class & such that for each e > 

sup y® n \ sup |E M (/) - E Mn (/)| > e I -> as n -> oo, (2.4) 
t+ev [fe& J 

(cf. Q, Ch. 3; [Hj].) Here \x n stands for the empirical (uniform) measure on n points, sampled in an 
i.i.d. fashion from fi according to the distribution fi. The symbol means the empirical mean of / on 
the sample a. One also says that & has the property of uniform convergence of empirical means (UCEM 
property) with respect to V 20] . 

In the case of a concept class < ^ 3 , the uniform Glivenko-Cantelli property becomes 



sup ^ n I sup \fi(C) - Hn(C)\ >eU0asn^oo. (2.5) 

In this case, one says that e to has the property of uniform convergence of empirical measures, which is also 
abbreviated to UCEM property (with respect to V). 

Every uniform Glivenko-Cantelli class (with respect to V) is PAC learnable (under V). In the distribution- 
free situation the converse holds under mild additional measurability conditions on the class (but not always 
Q, see a discussion in Section [3] below) . For learning under a single measure, it is not so: a PAC learnable 
class under a single distribution fi need not be uniform Glivenko-Cantelli with respect to fi (cf. Chapter 6 
in [2^|, or else [3], Example 2.10, where a countable counter-example is given). Not every PAC learnable 
class under non-atomic measures is uniform Glivenko-Cantelli with respect to non-atomic measures either: 
the class consisting of all finite and all cofinite subsets of fi is a counter-example. 

We say, following Pollard , that a function class is universally separable if it contains a countable 
subfamily which is universally dense in J?: every function / G J? is a pointwise limit of a sequence of 
elements of . By the Lebesgue Dominated Convergence Theorem, for every probability measure /i on fi 
the set is everywhere dense in & in the L 1 (/x)-distance. In particular, a concept class ^ is universally 
separable if it contains a countable subfamily c €' with the property that for every C G c € there exists a 
sequence (C n )'^' =1 of sets from ^" and for every x G fi there is N with the property that, for all n > N, 
x G C n if x G C, and x £ C n if x £ C. 

Probably the main source of uniform Glivenko-Cantelli classes is the finiteness of VC dimension. Assume 
that ^ satisfies a suitable measurability condition, for instance, ^ is image admissible Souslin Q, or else 
universally separable. (In particular, a countable satisfies either condition.) If VC(^) = d < 00, then 'rf 
is uniform Glivenko-Cantelli, with a sample complexity bound that does not depend on but only on e, 
5, and d. The following is a typical (and far from being optimal) such estimate, which can be deduced, for 



instance, along the lines of [12|: 



s(e,S,d) < ^ (dlog (^logf ) +log|) . (2.6) 

For our purposes, we will fix any such bound and refer to it as a "standard" sample complexity estimate for 
s(e,S, d). 

Let us recall a more general concept of fat shattering dimension [lj which is relevant for function classes. 
Let e > 0. A finite subset A of fi is e-fat shattered by a function class & with witness function h : A — > [0,1] 
if for every B C A there is a function fs G & such that 

if B (a) > h(a) + e for a G B, 

I f B {a) < h(a) - e for a G A \ B. [ ' ' 



G 



The e-fat shattering dimension of & (over the domain O) is defined as 

fateJ 5 " = sup {\A\ : A C CI, A is e-fat shattered by J?} . 

In particular, if ^ is a concept class, then for any e < 1/2 the e-fat shattering dimension of c € is the VC 
dimension of '. If we want to stress that the combinatorial dimension is calculated over a particular domain 
f2, we will use the notation fat e (J^ \ fi) and VC(^ \ ft). 

In the definition of e-fat shattering dimension, one can assume without loss of generality the values of e 
and of a witness function to be rational. More precisely, the following holds. 

Lemma 2.1. Suppose a finite set A is e-fat shattered by a function class & . Then there is a rational value 
e' > e such that A is e' '-fat shattered by & with a rational-valued witness function h' : A — > Q. 

Proof. Let h be a witness of e-fat shattering for A. For each B C A choose a function fs satisfying Condition 
(|2.7p . For every a 6 A define 

S a = min/ s (a), s a = max f B (a). 

a£B a£A\B 

One has: s a < h(a) — e < h(a) + e < S a , and so S a — s a > 2e. One can therefore select rational values e' a > e 
and ft,' (a) such that s a + e' a < h'(a) < S a — e' a . This way, we obtain a desired witness function h! , and the 
proof is now finished by posing e' = min a6j 4 e' a . □ 

Every function class & whose e-fat shattering dimension is finite at every scale e > is uniform Glivenko- 
Cantelli. Here is an asymptotic estimate of the sample size taken from Q (Theorem 3.6): 

s(e, S, d) < C (±d(e/24)(3?) log 2 + log , (2.8) 

where d: R + — > N is the fat-shattering dimension of J?" understood as a function of epsilon, d(e) = fat e (j£"). 
In the formula, C denotes a universal constant whose value can be extracted from the proofs in [l|, but, 
given the presence of such a loose scale as e/24, does not really matter. Tighter sample size estimates can 
be found in Again, we will refer to Condition (|2.8I) as "standard" complexity estimate corresponding to 
the fat shattering dimension function d. 

Finally, recall that a subset N C VL is universal null if for every non-atomic probability measure \i on 
(O, g/) one has n(N') = for some Borel set N' containing N. Universal null Borel sets are just countable 
sets. 



3. Revisiting an example of Durst and Dudley 

In order to explain our approach to constructing a learning rule that is PAC under non-atomic distri- 
butions, we need to examine the traditional way of proving distribution-free PAC learnability. A usual 
approach consists of two stages. 

1. A function (or concept) class & is uniform Glivenko-Cantelli as long as a suitable combinatorial 
parameter of & (VC dimension, fat-shattering dimension etc.) is finite. 

2. A uniform Glivenko-Cantelli class & is PAC learnable. Moreover, such a class is consistently PAC 
learnable: every consistent learning rule C for & is probably approximately correct. 

The proof of every statement of the former type depends in an essential way on the Fubini theorem, and 
so some measurability restrictions on the class & are necessary. Without them, the conclusion is not true 
in general. Here is a classical example of a concept class having finite VC dimension which is not uniform 
Glivenko-Cantelli. 

Example 3.1 (Durst and Dudley @, Proposition 2.2; cf. also [2l|, p. 314; 0, pp. 170-171). Let Q, be 
an uncountable standard Borel space, that is, up to an isomorphism, a Borel space associated to the unit 
interval [0, 1]. The cardinality of f2 is continuum. Choose a minimal well-ordering -< on f2, and let %? consist 
of all half-open initial segments of the ordered set (fi, -<), that is, subsets of the form I y = {x E il : x -< y}, 
y € f2. Clearly, the VC dimension of the class ^€ is one. 
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Fix a non-atomic Borel probability measure \x on Q (e.g., the Lebesgue measure on [0, 1]). 

Now assume the validity of the Continuum Hypothesis. Under this assumption, every element of ^ is a 
countable set, therefore Borel measurable of measure zero. At the same time, for every n and each random 
n-sample a, there is a countable initial segment C £ containing all elements of a. The empirical measure 
of C with respect to a is one. Thus, no finite sample guesses the measure of all elements of ^ to within an 
accuracy e < 1 with a non-vanishing confidence. 

A further modification of this construction gives an example of a concept class of finite VC dimension 
which is not consistently PAC learnable. 

Example 3.2 (Blumer, Ehrenfeucht, Haussler, and Warmuth p. 953). Again, assume the Continuum 
Hypothesis. Add to the concept class & from Example 13. 1 1 the set ft as an element. In other words, 
form a concept class c € l consisting of all intitial segments of (O, -<), including improper ones. One still has 
VC(¥") = 1. For a finite labelled sample (<t,t) define 

C(a, t) = min{y: r C I y }. (3.1) 

The learning rule C is clearly consistent with the class ^ ', but is not probably approximately correct, because 
for the concept C = fi the value £(Q n a) = C(a,a) will always return a countable concept I y , and if \x is 
a non-atomic Borel probability measure on f2, then \i{C A I y ) = 1. The concept C = O is not learned to 
accuracy e < 1 with a non-zero confidence. 

Remark 3.3. It is important to note that — again, under the Continuum Hypothesis — the class c £' is 
nevertheless distribution-free PAC learnable. 

Indeed, redefine a well-ordering on "if' = {I x : x £ f2} U {f2} by making f2 the smallest element (instead 
of the largest one) and keeping the order relation between the other elements the same. Denote the new 
order relation by -<i, and define a learning rule C\ similarly to Eq. (13. ip . but this time understanding the 
minimum with respect to the order 

C 1 (a,r)=mia\c £^":Cna= C\ D} . (3.2) 

™\ rCD J 

In essence, C\ examines all the concepts following a transfinite order on them, and if a labelled sample is 
consistent with the class then C\ returns the first concept consistent with the sample that it comes 
across. 

To understand what difference it makes with Example 13.21 let /x be again a non-atomic probability 
measure on Q. If C = fl, then for every sample a consistently labelled with C the rule C\ will return C, 
because this is the smallest consistent concept encountered by the algorithm. If C ^ J7, then for /i-almost 
all samples tr (that is, for a set of /i-measure one) the labelling on a produced by C will be empty, and the 
concept £i(<7, 0) returned by C\, while possibly different from C, will be again a countable concept, meaning 
that n(C A£(cr,0)) = 0. 

To give a formal proof that C\ is PAC, notice that for every C £ c €' and each n £ N the collection of 
pairwise distinct concepts C\(aC\C), a £ O™ is only countable (under Continuum Hypothesis), because they 
are all contained in the ^i-initial segment of a minimally ordered set of cardinality continuum, bounded 
by C itself. As a consequence, the concept class 

£f = {dianC): a £n n ,n£N} Ctf' (3.3) 

is also countable (assuming Continuum Hypothesis). The VC dimension of the family £f U {C} is < 1, and 
being countable, it is a uniform Glivcnko-Cantelli class with a standard sample complexity as in Eq. (|2.6[) . 
Consequently, given e,<5 > 0, and assuming that n is sufficiently large, one has for each probability measure 
/ion(] and every a £ Q n 

\x(C A £(<t, C n a)) < e 

provided n > s(e, S, 1), as required. 
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Remark 3.4. Thus, under the Continuum Hypothesis, the example of Dudley and Durst as modified by 
Blumer, Ehrenfeucht, Haussler, and Warmuth gives an example of a PAC learnable concept class which is 
not uniform Glivenko-Cantelli (even if having finite VC dimension). As it will become clear in the next 
Section, the assumption of Continuum Hypothesis can be weakened to Martin's Axiom. Still, it would be 
interesting to know whether an example with the same combination of properties can be constructed without 
additional set-theoretic assumptions. 

A basic observation of this section is that in order for a learning rule C to be PAC, the assumption on 
& being uniform Glivenko-Cantelli can be weakened as follows. 

Lemma 3.5. Let & be a function class and V a family of probability measures on the domain SI. Suppose 
there exists a function s(e, S) and a consistent learning rule C for & with the property that for every f £ & , 
the set & U {/} is Glivenko-Cantelli with respect to V with the sample complexity s(e,S), where 

C f ={£(/ \<t): aen n ,nen}. 

Then C is probably approximately correct under V with sample complexity s(e,S). □ 

Remark 3.6. Of course instead of U {/} it is sufficient to make the same assumption on the class CJ . 
This will not affect the PAC learnability of C. However, an estimate for the sample complexity of the union 
in terms of s(e, 8) will be somewhat awkward, and in view of a specific way in which the above Lemma is 
going to be used, the current assumption is technically more convenient. 

This simple fact becomes very useful in combination with the technique of well-orderings in the case 
where V consists of non-atomic measures and therefore consistent PAC learnability is not to be expected. 
At the same time, this approach requires additional set-theoretic axioms in order to assure measurability 
of emerging function classes. Of course the Continuum Hypothesis is a rather strong assumption, which is 
particularly unnatural in a probabilistic context (cf. [7|). But it is unnecessary. Martin's Axiom is a much 
weaker and natural additional set-theoretic axiom, which works just as well. We explain how the above idea 
is formalized in the setting of Martin's Axiom in the next Section. 

4. Learnability under Martin's Axiom 

Martin's Axiom (MA) [1,0, QjJ m ° ne of its equivalent forms says that no compact Hausdorff topological 
space with the countable chain condition is a union of strictly less than continuum nowhere dense subsets. 
Thus, it can be seen as a strengthening of the statement of the Baire Category Theorem. In particular, 
the Continuum Hypothesis (CH) implies MA. However, MA is compatible with the negation of CH, and 
this is where the most interesting applications of MA are to be found. We will be using just one particular 
consequence of Martin's Axiom. For the proof of the following result, see [ll|, Theorem 2.21, or [8j], or Q, 
pp. 563-565. 

Theorem 4.1 (Martin-Solovay). Let (Sl,/z) be a standard Lebesgue non-atomic probability space. Under 
Martin's Axiom, the Lebesgue measure is 2 N ° -additive, that is, if k < 2 N ° and A a , a < k is family of pairwise 
disjoint measurable sets, then L) a<K A a is Lebesgue measurable and 




In particular, the union of less than continuum null subsets of Q is a null subset. □ 

Here is a central technical tool used in our proofs. 

Lemma 4.2. Let & be a function class and V a family of probability measures on a standard Borel domain 
SI. Consider the following properties. 

1. Every countable subclass of ^ is uniform Glivenko-Cantelli with respect to V ■ 
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2. There is a function s(e,<5) such that every countable subclass of & is uniform Glivenko-Cantelli with 
respect to V with sample complexity s(e, 5). 

3. Every subclass of & having cardinality < 2 N ° is uniform Glivenko-Cantelli with respect to V . 

4. There is a function s(e,5) such that every subclass of & having cardinality < 2 H ° is uniform 
Glivenko-Cantelli with respect to V with sample complexity s(e,S). 

Then 

w w 
w 

Under Martin's Axiom, all four conditions are equivalent. 

Proof. The implications © =S> Q, © (JJ), © =S> © and (@) © are trivially true. To show 
let (5, e > be arbitrary but fixed. For each countable subclass JF', choose the smallest value of sample 
complexity s = s{^' ', e, 5) G N. The integer-valued function J?"' >->• s(J^"', e, S) is monotone under inclusions: 
if C J?"", then s(J^"',e, 8) < s(^",e,S). If ^ is a countable sequence of countable classes, then the 
union \J^ =1 ,^' n is a countable class, whose sample complexity s i}J£ =1 &' n , e, (5) forms an upper bound for all 

s(&', e, 8), n = 1,2, Thus, the function J?"' \-> s(JP',e, 8) for <5, e > fixed is bounded on countable sets 

of inputs. To conclude the proof, it is enough to notice that a real- valued function is bounded if and only if 
its restriction to every countable subset of the domain is bounded. 

Now assume Martin's Axiom. It is enough to prove © => ©. This is done by a transfinite induction on 
the cardinality k = \&'\ < 2 N °. Let us pick the same complexity function s — s(e, 8) as in ©. For k = H 
there is nothing to prove. Else, represent & as a union of an increasing transfinite chain of function classes 
a < k, for each of which the statement of © holds. For every e > and neN, the set 

Left": sup |E Mn(CT) (/)-!„(/) | <el = f| Lefi": sup |E Mn((T) (/) - E M (/)| < e 1 

is measurable as an easy consequence of Martin-Solovay's Theorem 14. II Given 8 > and n > s(e, (5), another 
application of the same result leads to conclude that for every \i e P(O): 



C 8 " ^ sup E„„ ((t) (/) -E p (/) < e 



= inf ^"Lefi": sup |E M „ ((T )(/)-E M (/)| <el 



> 1-5, 

as required. □ 

Lemma 4.3. Let & be a function class whose countable subclasses are uniform Glivenko-Cantelli with 
respect to a family of probability measures V . Let L be a consistent learning rule for & with the property 
that for every f 6 & , the set 

Cf' n ={C(f\a): aeQ n } (4.1) 

has cardinality strictly less than continuum. Under Martin's Axiom, the rule C is probably approximately 
correct under V. The common sample complexity of countable subclasses of & becomes the sample complexity 
bound for the learning rule C. 
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Proof. Recall that 2 is a regular cardinal, and thus admits no countable cofinal subset. Therefore, under 
the assumptions of Lemma, the cardinality of CJ = U££L 1 £-' ,n is still strictly less than continuum. The same 
is true of the class CJ U {/}. Applying now Lemma T4. 2 1 and then Lemma 15751 we conclude. □ 

The following result establishes existence of learning rules with the above property. 

Lemma 4.4. Let & be an infinite function class on a measurable space Q. Denote k = \&\ the cardinality 
of & . There exists a consistent learning rule C for & with the property that for every f £ & and each n, 
the set £*' n (cf. Eq. j4-l\ )) has cardinality < n. Under Martin's Axiom the rule C satisfies the measurability 
condition H2.1\) . 

Proof. Choose a minimal well-ordering of elements of 

,^ = {f a :a< K }. 

Notice that k never exceeds the cardinality of the continuum 2 N ° because consists of Borel subsets of a 
standard Borel domain. For this reason, every initial segment of the above ordering has cardinality strictly 
less than 2 N °. For every a £ Q" and r £ [0, 1]™, set the value £(<r, r) of the learning rule equal to fp, where 

(5 = min{a < k: f a \a = t}, 

provided such a /3 exists. Clearly, for each a < k one has 

C(a,f a \a)C{fp:p<a}, 

which assures that the set in (|4.ip has cardinality strictly less than continuum. Besides, the learning rule C 
is consistent. 

Fix / = f a £ a < k. For every /3 < a define Dp = {a e Sl n : f\a — fp\a}. The sets Dp are 
measurable, and the function 

n n 9 <7i-»-Ep(£(/ tcr)-/)GR 

takes a constant value Wf — fpWh 1 ^) on each set Dp \ U 1< pD~ n (3 < a. Such sets, as well as all their possible 
unions, are measurable under Martin's Axiom by force of Martin-Solovay's Theorem 14. 11 and their union is 
fl n . This implies the condition (|2~T|) for C. □ 

Lemma T4. 3 1 and lemma l4~4l lead to the following result. 

Theorem 4.5 (Assuming Martin's Axiom). Let ^ be a function class consisting of Borel measurable 
functions on a standard Borel domain Q, and let V be a family of probability measures on fl. Suppose 
that every countable subclass of ^ is uniform Glivenko-Cantelli with respect to V . Then the function class 
& is PAC learnable. In addition, there exists a common sample complexity bound for countable subclasses 
of ,^ , and any such bound gives a sample complexity bound for PAC learnability of & '. □ 

We again recall that a set A C Q is universal null if it is Lebesgue measurable with respect to every 
non-atomic Borel probability measure fj, on Q and fx(A) = 0. 

Corollary 4.6 (Assuming Martin's Axiom). Let & be a function class consisting of Borel measurable 
functions on a standard Borel space Q. Suppose for every e > there is a natural number d(e) such that 
every countable subclass C has e-fat shattering dimension < d(e) outside of some universal null set 
(which depends on & ). Then the function class & is PAC learnable under the family V of non-atomic 
probability measures, with the standard sample complexity corresponding to the given value of fat shattering 
dimension. 

Proof. Let C & be a countable subclass. For every n e N, choose a null set A n such that the e-fat 
shattering dimension of restricted to fl\A n is bounded by d(l/n). Consider A = U^^An. The function 
class restricted to O \ A is uniform Glivenko-Cantelli, with the usual sample complexity given by die). 
In particular, J^'|f2\ A is uniform Glivenko-Cantelli with respect to the family V of non-atomic probability 
measures. Since n{A) = for all \x £ V, we conclude that the class is uniform Glivenko-Cantelli with 
respect to V even if viewed on the original domain of definition, tt. □ 
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Corollary 4.7 (Assuming Martin's Axiom). Let e ta be a concept class consisting of Borel measurable func- 
tions on a standard Borel space 57. Suppose that for some d every countable subclass C %f has VC 
dimension < d outside of a universal null set (which depends on c £'). Then the concept class c € is PAC 
learnable under the family V of non-atomic probability measures, with the standard sample complexity cor- 
responding to the given value of VC dimension. □ 



5. VC dimension and Boolean algebras 

Recall that a Boolean algebra, B — (B, A, V,-i, 0, 1), consists of a set, B, equipped with two associative 
and commutative binary operations, A ("meet") and V ("join"), which are distributive over each other and 
satisfy the absorption principles a V (a A b) = a, a A (a V b) = a, as well as a unary operation -i (complement) 
and two elements and 1, satisfying a V ->a = 1, a A ->a = 0. 

For instance, the family 2 of all subsets of a set 57, with the union as join, intersection as meet, the 
empty set as and 57 as 1, as well as the set-theoretic complement -^A — A c , forms a Boolean algebra. In 
fact, every Boolean algebra can be realized as an algebra of subsets of a suitable 57. Even better, according 
to the Stone representation theorem, a Boolean algebra B is isomorphic to the Boolean algebra formed by all 
open-and-closed subsets of a suitable compact space, S(B), called the Stone space of B, where the Boolean 
algebra operations are interpreted set-theoretically as above. 

The space S(B) can be obtained in different ways. For instance, one can think of elements of S(B) as 
Boolean algebra homomorphisms from B to the two-element Boolean algebra {0, 1} (the algebra of subsets 
of a singleton). In this way, S(B) is a closed topological subspace of the compact zero-dimensional space 
{0, 1} S with the usual Tychonoff product topology. 

The Stone space of the Boolean algebra B = 2° is known as the Stone-Cech compactification of 57, and 
is denoted [30.. The elements of /357 are ultrafilters on 57. A collection £ of non-empty subsets of 57 is an 
ultrafilter if it is closed under finite intersections and if for every subset A C 57 either A £ £ or A c e £. To 
every point x € 57 there corresponds a trivial (principal) ultrafilter, x, consisting of all sets A containing 
x. However, if 57 is infinite, the Axiom of Choice assures that there exist non-principal ultrafilters on 57. 
Recall that a non-empty family $ of non-empty subsets of a set X is a filter if it is closed under finite 
intersections and supersets. An equivalent form of the Axiom of Choise states that every filter is contained 
in an ultrafilter. Now starting with a filter having an empty intersection (e.g. the filter of all cofinite subsets 
of the natural numbers), one obtained a non-principal ultrafilter. 

Basic open sets in the space /3Q are of the form A = 6 /3f2: A 6 £}, where A C 57. It is interesting to 
note that each A is at the same time closed, and in fact A is the closure of A in /357. Moreover, every open 
and closed subset of /357 is of the form A. 

A one-to-one correspondence between ultrafilters on 57 and Boolean algebra homomorphisms 2 n — > {0, 1} 
is this: think of an ultrafilter £ on 57 as its own indicator function X{ on 2°, sending A C 57 to 1 if and 
only if A G £. It is not difficult to verify that Xi 1S a Boolean algebra homomorphism, and that every 
homomorphism arises in this way. 

The book [l(| is a standard reference to the above topics. 

Given a subset c £ of a Boolean algebra B, and a subset X of the Stone space S(B), one can regard 'W as 
a set of binary functions restricted to X, and compute the VC dimension of ^ over X. We will denote this 
parameter VC(^ \ X). 

A subset / of a Boolean algebra B is an ideal if, whenever x,y € I and a G B, one has x V y € I and 
a Air G I. Define a symmetric difference on B by the formula xAy~ (xVy) A^(xAy). The quotient Boolean 
algebra B/I consists of all equivalence classes modulo the equivalence relation x ~ y x A y £ I. It can 

be easily verified to be a Boolean algebra on its own, with operations induced from B in a unique way. 

The Stone space of B/I can be identified with a compact topological subspace of S(B), consisting of all 
homomorphisms B — > {0, 1} whose kernel contains /. For instance, if B = 2 n and I is an ideal of subsets 
of ft, then the Stone space of 2 n / 1 is easily seen to consist of all ultrafilters on 57 which do not contain sets 
from /. 
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Theorem 5.1. Let c (o be a concept class consisting of measurable subsets of a measurable domain Q = 
and let I be an ideal of sets on f2. The following conditions are equivalent. 

1. The VC dimension of the (family of closures of the) concept class restricted to the Stone space of 
the quotient algebra 2 n /I is at least n: VCi^S \ S(2 Q /I)) > n. 

2. There exists a family A%, A2, ■ ■ ■ , A n of subsets of f2 not belonging to I , which is shattered by in the 
sense that if J C {1,2, ... ,n}, then there is C £ which contains all sets Ai, i £ J, and is disjoint 
from all sets Ai, i ^ J. In addition, the subsets Ai can be assumed measurable. 

Proof. CQ)^©. Choose ultrafilters £1, . . . , £„ in the Stone space of the Boolean algebra 2 n /I, whose collec- 
tion is shattered by ^ . For every J C {1,2, ... ,n}, select Cj £ which carves the subset {£,; : i £ J} out 
of {£1, . . . , £„}. This means Cj £ £j if and only if i £ J. For all i = 1, 2, . . . , n, set 

At = n Cj n n c}. (5.1) 

Then Ai £ £i and hence Ai $ I. Furthermore, if i £ J, then clearly Ai C Cj, and if i ^ J, then Ai n Cj = 0. 
The sets Aj are measurable by their definition. 

©=^(H1)- Let A\, A2, ■ ■ ■ , A n be a family of subsets of f2 not belonging to the set ideal / and shattered 
by in sense of the lemma. For every i, the family of sets of the form Ai D B c , B £ I is a filter and so is 
contained in some ultrafilter £j, which is clearly disjoint from / and contains Ai. If J C {1, 2, . . . , n} and 
Cj £ c <o contains all sets Ai, i £ J and is disjoint from all sets Ai, i ^ J, then the closure Cj of Cj in the 
Stone space contains £j if and only if i £ J. We conclude: the collection of ultrafilters £i, i = 1, 2, . . . , n, 
which are all contained in the Stone space of 2 /1, is shattered by the closed sets Cj. □ 

It follows in particular that the VC dimension of a concept class does not change if the domain ft is 
compactified. 

Corollary 5.2. VC(^ \ ft) = VC(tf \ /3ft). 

Proof. The inequality VC( < ^ J \ O) < VC^ \ /3f2) is trivial. To establish the converse, assume there is a 
subset of /3f2 of cardinality n shattered by ^ . Choose sets Ai as in Theorem I5.1K (2|). Clearly, any subset of 
f2 meeting each Aj at exactly one point is shattered by . □ 

Definition 5.3. Given a concept class ^ on a domain f2 and an ideal J of subsets of Q, we define the VC 
dimension of ^ modulo /, 

VC(^mod7) = VC(%f \ S{2 n /I)). 
That is, VC( C #' modi) > n if and only if any of the equivalent conditions of Theorem 15. II are met. 

Definition 5.4. Let ^ be a concept class on a domain i7. If I is the ideal of all countable subsets of Q., we 
denote the VC^modJ) by VC(%?modwi) and call it the VC dimension modulo countable sets. 

Now Theorem 15.11 validates a definition of VC dimension modulo countable sets in a form stated in 
Introduction to our article. 

6. Fat-shattering dimension modulo countable sets 

When dealing with real-valued functions instead of subsets of the domain, the role of Boolean algebras 
is taken over by commutative C*-algebras. Here is a brief summary. See e.g. [2| for more. 

Recall that a C*-algebra is an associative algebra over the field of complex numbers C equipped with an 
involution (an anti-linear map x 1— > x*) and a norm which is submultiplicative < ||x|| ||y||) and satisfies 

the property ||x*a;|| = ||a;|| 2 . For instance, the family C(X) of all continuous complex-valued functions on a 
compact topological space X forms a commutative unital C*-algcbra. Conversely, every commutative unital 
C*-algebra A is of this form. The space X , called the Gelfand space, or the maximal ideal space of A, is 
uniquely defined. Its elements can be described as non-zero multiplicative complex linear functionals on 
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A. The topology on the space of such functionals is the weak star (weak*) topology, that is, the coarsest 
topology making every evaluation map / i— > /(a), a G A, continuous. 

We want to calculate the maximal ideal space of the C*-algebra ^°°(f2) of all bounded complex- valued 
functions on a set Vl. With this purpose, we introduce the following notion. 

Given a bounded scalar- valued function / on a set fl and an ultrafilter £ on f2, the limit of f along the 
ultrafilter £ is a uniquely defined number, y, with the property that for each e > 0, 

{xen:\f(x)-y\<e}€t. (6.1) 

The limit along an ultrafilter, or an ultralimit, for short, is denoted lim^^ f(x). Unlike the usual limit, the 
ultralimit of a bounded function along a fixed ultrafilter always exists, the proof of which fact mimicks the 
classical Heine Borel compactness argument for the closed interval. This observation makes the ultralimit 
a very powerful tool. Its downside is a highly non-constructive nature: typically, the value of an ultralimit 
of a particular function cannot be computed explicitely except in the "uninteresting" situations where it 
coincides with the usual limit. 

The correspondence £ i-> lim x _>£ f(x) defines a continuous function / on /3tt, which is a unique continuous 
extension of / over the Stone-Cech compactification f3£l. Here, as is usual in set-theoretic topology and 
analysis, we identify every point x of ft with the corresponding principal (trivial) ultrafilter, x, consisting of 
all subsets of ft which contain x as an element. 

If an ultrafilter £ is fixed, then the correspondence / i-> /(£) is a linear multiplicative functional of norm 
one on £°°(Ct), sending the function 1 to 1. It turns out that every linear multiplicative functional (f> of norm 
one on ^°°(f2) sending 1 to 1 is of this form, that is, is the ultralimit along some ultrafilter on 0. This is, 
in fact, a rather simple observation: suffices to restrict <f> to the set of all {0, l}-valued functions on Vl and 
notice that the image of every such function is necessarily either or 1; the family £ of all sets AC f! with 
4>{xa) = 1 is now seen to be an ultrafilter, and an approximation argument with finite linear combinations 
shows that for every / € £°°(fl) one must have <j>(f) = lim^-^ f(x). In this way the maximal ideal space of 
£°°(n) is identified with the space of ultrafilters /3f2, that is, the Stone-Cech compactification of f2. Thus, 
the C*-algebras £°°(fl) and C((3Q) are isomorphic. An isomorphism is given by the map / i-> / , where / is 
the unique continuous extension of / over (3Vl mentioned above. 

Given a C* -algebra, an ideal I of A is a closed linear subspace stable under multiplication by elements 
of A. The quotient algebra A/I is again a C*-algebra (which is in general not an easy fact to prove). If A is 
a commutative unital C*-algebra and / is a non-trivial ideal (I ^ A), then A/ 1 is isomorphic to an algebra 
of continuous functions on a suitable closed subspace Y of the maximal ideal space X of A. A functional 
x e X belongs to Y if and only if it factors through the quotient map tt: A — > A/ 1, that is, the kernel of 
x: A — > C contains /. 

Conversely, every compact subspace of X determines an ideal of C(X). 

A link with the Boolean algebra setting is provided by the following observation: every ideal / of subsets 
of fl generates an ideal I of the C*-algebra £°°(fl), as the smallest ideal of A containing characteristic 
functions of all elements of /. Now one can verify without difficulty that the maximal ideal space of the 
C*-algebra £°°{Q)/I is the Stone space of the Boolean algebra 2 n /I. In fact, every ideal of £°°(r2) is of this 
form. 

Definition 6.1. Let A be a commutative unital C*-algebra, J?" a subset of A, and / an ideal of A. For every 
e > 0, define the e-fat shattering dimension of & modulo I, denoted fat e (^mod/), as the e-fat shattering 
dimension of & viewed as a function class on the maximal ideal space Y oi A/I. 

In a more detailed way, we denote tt: A — > A/ 1 the quotient homomorphism. A finite set B C Y is e-fat 
shattered by & if for some function h: B — >• [0, 1] and every C C B there is fc <£ & with 

{y(Afc))>h(y) + e, yeC, 
\y(n(f c ))<h(y)-e, y{C. 

Here elements y e Y are treated as functionals on A/ 1. The e-fat shattering dimension of & modulo J, 
denoted fat e (^mod/) is the supremum of cardinalities of finite subsets of the maximal ideal space of A/ 1 
e-fat shattered by & . 
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Definition 6.2. Let be a function class on a domain SI, and let e > 0. We call the e-fat shattering 
dimension of & modulo countable sets the value fat e mod I) , where / is a C*-algebra ideal of 
generated by characteristic functions of countable sets. 

Now we reformulate Definition 16 . 2 1 avoiding the C*-algebraic terminology. Let /3 Ul Q denote the collection 
of all points of /3f2 which, viewed as ultrafilters on fi, only contain uncountable sets. The e-fat shattering 
dimension of & modulo countable sets is the usual e-fat shattering dimension of the class of functions / G ^ 
extended over /3f2 by continuity and then restricted to fi^^l. 

We have an analogue of Theorem 15.11 

Theorem 6.3. Let ^ be a class of measurable functions on a standard Borel domain f2, and let I be an 
ideal of the C* -algebra £°°(fl). Fix any e > 0. The following are equivalent. 

1. The e-fat shattering dimension of ^ modulo I is at least n. 

2. There exists a family A\,Ai, . . . , A n of measurable subsets offl whose indicator functions do not belong 
to I , which is e-fat shattered by ,^ in the following sense: there is a witness function h: {1, 2, . . . , n} — > 
[0, 1] and for each J C {1, 2, . . . , n} there is a fj G & such that 

(i G J A x G Ai) => fj(x) > h(i) + e, , 
(i(£JAxeAi)^ fj(x) < h(i) - e. { ' 

Proof. Before proceeding to the argument, let us remind that ultrafilters on f2 are viewed sometimes as 
mere points of the Stone-Cech compactification /3S1, and sometimes as families of subsets of fi. Every point 
x G SI is canonically identified with the corresponding principal ultrafilter x, and every bounded function 
/ on 57 admits a canonical continuous extension over /3f2 via the rule /(£) = \im. x ^ f(x). Notice that this 
definition implies f(x) = f{x) whenever x G fi. 

©=^(G])- Let Y C /3S1 denote the maximal ideal space of the C*-algebra £°°(fl)/L. In other words, 
£°°(S1)// = C{Y). There exist n elements of Y which are e-fat shattered by J^", let us say £i, . . . , £„. 
Recall that these are ultrafilters on SI, that is, families of subsets of the domain. Choose a witness function 
h : {1, 2, . . . , n} — > [0, 1], and select for every J C {1, 2, . . . , n} a function fj G & whose ultralimit along 
is > h(i) + e if i G J, and is < h(i) — e otherwise. For alii = 1, 2, . . . , n, denote by 

A = p| {e G PQ: Jj{0 > h(i) + e} n p| G fifl: 77(0 < h(i) - e} , (6.3) 

J3i Jjti 

and consider Ai = Ai n SI. For every i one has & G Ai by the choice of the functions fj. Since the value 
fj{£,i) is the ultralimit of fj along it follows from the definition of an ultralimit (|6.1[) that each of the 2 n 
sets appearing in Eq. (|6.3I) belongs to £j, and since £j is closed under finite intersections, one has Ai G £j. 
Equivalently, XA7(£i) = lj which implies that \A { t I (as every function in the ideal I — or, a bit more 
precisely, its unique continuous extension over /3S1 — identically vanishes on Y). Since the functions fj are 
measurable with regard to the Borel structure on SI, so are the sets Ai. The condition ([2) is verified by the 
definition of the sets Ai. 

©=^(ffl)- Let A%, A2, ■ . ■ ,A n be a family of subsets of SI satisfying ([2]). Their topological closures Ai 
taken in /3fi satisfy 

(ie JA£e ^)=>7F(0 >h{i) + e, 

{it J A £ G ~A~) => 77(0 < h(i) - e. 

The condition \A t 4- I can be reformulated as Ai fl Y ^ 0. Choose G n Y for every i = 1,2, ... ,n. The 
set {£1)1=1 is e-fat shattered by the functions /, / G & with the witness function & 1— ► □ 

Remark 6.4. Note that we have not used the assumption of measurability of subsets Ai in the proof of the 
implication ©^(P). 

Corollary 6.5. Let & be a class of [0, \]-valued functions on SI and let e > 0. The e-fat shattering dimension 
of J? equals the e-fat shattering dimension of the set of functions f , f G & on /3f2. □ 
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Corollary 6.6. Let ^ be a class o/[0, l]-valued functions on f2 and let e > 0. The e-fat shattering dimension 
of modulo countable sets is the supremum of cardinalities of finite families A\, A2, . . . , A n of uncountable 
subsets of fl which are e-fat shattered by & in the sense of Condition \6.'2\) with a suitable witness function 
h: {1,2,... ,n} ->• [0,1]. □ 



7. Finiteness of combinatorial dimension modulo countable sets as a necessary condition 

In this Section, we remark that, similarly to the classical case of distribution-free learning, finiteness of 
VC dimension modulo countable sets is necessary for PAC learnability of a concept class under non-atomic 
measures, but this is not the case for fat shattering dimension of a function class. 

Lemma 7.1. Every uncountable Borel subset of a standard Borel space supports a non-atomic Borel prob- 
ability measure. 

Proof. Let A be an uncountable Borel subset of a standard Borel space fi, that is, fl is a Polish space 
equipped with its Borel structure. According to Souslin's theorem (see e.g. Theorem 3.2.1 in |2j), there 
exists a Polish (complete separable metric) space X and a continuous one-to-one mapping / : X — > A. The 
Polish space X must be therefore uncountable, and so supports a non-atomic probability measure, v. The 
direct image measure f*v = v{f~ 1 {B)) on il is a Borel probability measure supported on A, and it is 
non-atomic because the inverse image of every singleton is a singleton in X and thus has measure zero. □ 

The following result makes no measurability assumptions on the concept class. 

Theorem 7.2. Let c (o be a concept class on a domain which is a standard Borel space. lf c € is PAC 

learnable under non-atomic measures, then the VC dimension of c & modulo countable sets is finite. 

Proof. This is just a minor variation of a classical result for distribution-free PAC learnability (Theorem 
2.1(i) in |4{; we will follow the proof as presented in [20|, Lemma 7.2 on p. 279). 

Suppose VC^modwi) > d. According to Theorem 15.11 there is a family of uncountable Borel sets Ai, 
i = 1, 2, . . . , d, shattered by c (§ in our sense. Using Lemma l7.ll select for every i = 1, 2, . . . , d a non-atomic 
probability measure fii supported on Ai, and let n = h Y^%=i Mi- This /1 is a non-atomic Borel probability 
measure, giving each Ai equal weight 1/d. See Figure [3] 



diffuse measures of mass 1/d supported on Aj 




Aj A 2 .... A d 



Figure 3: Construction of the measure fi. 



For every d-bit string a there is a concept C a € ^ which contains all Ai with Oi = 1 and is disjoint from 
Ai with Gi = 0. li A and B take constant values on all the sets Ai, i = 1, 2, . . . , d, then d^A, B) is just the 
normalized Hamming distance between the corresponding <i-bit strings. Now, given A € 'rf and < k < d, 
there are 

k<2ed 

concepts B with d^{A,B) < 2e. This allows to get the following lower bound on the number of pairwise 
2e-separated concepts: 

od 



J2k<2ed (k) 
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The Chernoff-Okamoto bound allows to estimate the above expression from below by exp[2(0.5 — 2e) 2 d]. 
We conclude: the metric entropy of & with regard to [i is bounded from below by 

M(2e,tf,p,) > exp[2(0.5- 2efd]. 

The assumption VC^modwi) = oo now implies that for every < e < 0.25, 

sup M(2e, < <o 1 //) — oo, 
Pev 

where V denotes the family of all non-atomic measures on Q. By Lemma 7.1 in [20], p. 278, the class c € is 
not PAC learnable under V . □ 

On the contrary, a function class & can be PAC learnable under non-atomic measures and still have an 
infinite fat-shattcring dimension modulo countable sets. The following is an adaptation of Example 2.10 in 

0- 

Example 7.3. For a given n € N, call any interval of the form [i/n, (i + I)/"-], i = 0, 1, . . . , n — 1 an interval 
of order n. Form the class ^ n consisting of all unions of less than %fn intervals of order n. Let 'rf be the 
union of classes n€ff. Now we will transform ^ into a function class. With this purpose, establish a 
bijection i between ^ and the rational points of the interval [0, 1/3]. Let & consist of all functions of the 
form fc, where 

fc(x)=xc(x) + (-l) xc(x) i(C). 

Each function fc takes its (rational) values in [0, 1/3] U [2/3, 1] and is uniquely identifiable by its value at 
any single point x € [0, 1]. For this reason, the class & is (exactly) learnable. A learning rule is given, for 
instance, by C(x,r) = i (min{r, 1 — r}), where [x,r) is a learning 1-sample. 

At the same time, fatx^J^modwi) = oo. Indeed, given any k G N, an arbitrary collection I±, I2, . ■ ■ , Ik 
of k pairwise distinct intervals of order n = k 3 is 1 / 6-shattered by the functions fc, C G ^„ with the witness 
function taking a constant value 1/2. 

This example can be further modified. For instance, one can consider a larger class & consisting of all 
functions / for which there exists a g £ & with {x: f(x) ^ g(x)} being a universal null set. The class & is 
probably exactly learnable by the same learning rule C as above. 



8. The universally separable case 

In this Section we will express our versions of the combinatorial dimension modulo countable sets in 
terms of the corresponding classical notions. Namely, we will prove that VC(^ modwi) < d if and only if 
every countable subclass of has VC dimension d outside of a suitable countable set, and similarly for fat 
shattering dimension. 

Lemma 8.1. Let & be a universally separable function class, with a universally dense countable subset . 
Then for every e > 

fet 6 (JP) = fat £ (J?"')- 



Proof. For every / € & there is a sequence (/„) of elements of which converges to / pointwise: given 
a finite A C and an 7 > 0, there is an N such that whenever n > N, one has \f(x) — f n {x)\ < 7 for all 
x E A. This means that if A is e-fat shattered by it is equally well shattered by J^', with the same witness 
function. This observation establishes the inequaity fat e (J?") < fat e (J^"'), while the converse inequality is 
trivially true. □ 

Since for a concept class ^ one has VC(^) = fat c ( < ^') whenever e < 1/2, we obtain: 
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Corollary 8.2. Let ^ be a universally separable concept class, and let be a universally dense countable 
subset of ^ . Then 

VC{ c g) = VC{^'). 

While a version of the following result for fat shattering dimension covers the VC dimension as a particular 
case, the proof is technically more complicated, and we feel that the complications obscure the simple idea 
of the proof for VC dimension. For this reason, we give a separate presentation for VC dimension first. 

Theorem 8.3. For a universally separable concept class , the following conditions are equivalent. 

1. VC^moAux) < d. 

2. There exists a countable subset ACQ such that VC(^€ [" (n \ -A)) < d. 

Proof. (H])=>([2|): Choose a countable universally dense subfamily of . Let 38 be the smallest Boolean 
algebra of subsets of f2 containing c €' . Denote by A the union of all elements of 38 that are countable sets. 
Clearly, 3$ is countable, and so A is a countable set. 

Let a finite set B C f2 \ A be shattered by ^ . Then, by Corollary 18.21 it is shattered by . Select a 
family ,5f of 2' B ' sets in c £' shattering B. For every b <E B the set 

[b]= n ° n n ° c 

is uncountable (for it belongs to 33 yet is not contained in A) , and the collection of sets [b] , b £ B is shattered 
by c €' . According to ([I}, \B\ < d, from which we deduce @. Notice that this establishes the inequality 
VC(%f t A)) < VC(*^modo;i). 

©=^(ffl) : Fix an A C ft such that VC^ modA c ) < d. Suppose a collection of n uncountable sets 
Ai, i — 1,2, ... ,n is shattered by c <§ in our sense. The sets Ai\ A are non-empty; pick a representative 
<ii € Ai \ A, i = 1, 2, . . . , n. The resulting set {ai}™ =1 is shattered by meaning n < d. □ 

Now a version for fat shattering dimension. 

Theorem 8.4. For a universally separable function class & and e > 0, the following conditions are equiv- 
alent. 

1. fat £ mod ojx) < d. 

2. There exists a countable subset A C fi such that fat e (^" \ A)) < d. 

For a universally separable function class and e > 0, the conditions are equivalent. 

Proof. ([U^C]): For a function / on £1 and rfR, denote 

[/ < r] = {x e Q: f{x) < r} and [/ > r] = {x € fi: f(x) > r}. 

Let be a countable universally dense subfamily of & " . Denote by 38 the smallest algebra of subsets of ft 
containing all sets [/ < r], [/ > r] for / £ and r E Q. Now denote by A the union of all elements of 38 
that are countable sets. Since 38 is countable, so is A. 

Let a finite set B C fl \ A be e-fat shattered by Then, by Lemma [8.11 it is shattered by J^', and 
by Lemma 12.11 there is a rational e' > e and a rational- valued function h: B — > Q such that B is e'-fat 
shattered by a family 5? of 2 I s ' functions in with h as a witness function. 

For every b 6 B form the set 

[b] = {x e n : VC C B, beC^> f c (x) > h(b) + e' A 

b^C^fc(x) < h(b)-e'}. 

The set [b] belongs to the algebra of sets 38 and is not contained in A (for instance, b £ [b] and b ^ A). 
Therefore, [b] is uncountable. If b, c £ B and b ^ c, then [b] f] [c] = 0. Finally, the collection of sets [b], b € B 
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is e'-fat shattered by with h as a witness function, hence e-fat shattered. Since \B\ < d, we have proved 
@, and established the inequality fat e (jF \ (fi \ A)) < fat e (J?" mod uii). 

©=^(H]) : Fix a countable subset A C SI such that fat e (J^"mod A c ) < d. Suppose a collection of n 
uncountable sets Ai, i — 1, 2, . . . , n is e-fat shattered by the function class The sets Ai \ A are non- 
empty, so we can select a representative et^ in each one of them, i = 1, 2, . . . , n. The resulting set {ai}f =1 is 
e-fat shattered by J^", meaning n < d. □ 

Corollary 8.5. Let'tf be a universally separable concept class on a Borel domain Q. Ifd = VC{ c (o moduli) < 
oo, then c & is a uniform Glivenko-Cantelli class with respect to non-atomic measures and consistently PAC 
learnable under non- atomic measures, with a standard sample complexity corresponding to d. 

Proof. The class ^ has finite VC dimension in the complement to a suitable countable subset A of SI, hence 
^ is a universal Glivenko-Cantelli class (in the classical sense) in the standard Borel space SI \ A. But A is a 
universal null set in SI, hence clearly %f is universal Glivenko-Cantelli with respect to non-atomic measures. 

The class ^ is distribution- free consistently PAC learnable in the domain fl \ A, with the standard sample 
complexity s(e, 5, d). Let C be any consistent learning rule for ^ in SI. The restriction of C to ft \ A (more 
exactly, to ii^ =1 ((SI \ A) n x {0, 1}")) is a consistent learning rule for ^ restricted to the standard Borel 
space O \ A, and together with the fact that A has measure zero with respect to any non-atomic measure, it 
implies that £ is a PAC learning rule for ^ under non-atomic measures, with the same sample complexity 
function s(e, S, d). □ 

Similarly, we obtain: 

Corollary 8.6. Let & be a universally separable function class on a Borel domain fl. If for every e > 
one has d = fat e (j?modaJi) < oo, then ^ is a uniform Glivenko-Cantelli class with respect to non-atomic 
measures and consistently PAC learnable under non-atomic measures, with a standard sample complexity 
corresponding to d. □ 

Here are the two main conclusions of this Section. Notice that the following criteria no longer assume 
universal separability of the classes involved. 

Corollary 8.7. For a concept class ^ , the following are equivalent. 

1. VC-dimension of c to modulo countable sets is < d; 

2. For every countable subclass of ^ , there exists a countable A C f2 such that the VC-dimension of 

restricted to f2 \ A is < d. 

Proof. ([I])=>(E]): the VC dimension modulo countable sets is monotone with respect to subclasses, so 
VC(^" modwi) < d. Now Theorem 18.31 gives the desired conclusion. 

©=>([T|): assume uncountable sets A 1: A 2 , . . . , A n are shattered by c € . Select a family S of 2™ concept 
classes that does the shattering. There is a countable A such that VC(£> \ fi\ A) < d. Choose a representative 
ai in each of the non-empty sets Ai \ A. Since the set {ai}f =1 is shattered by the family S restricted to 
D. \ A, one concludes that n < d. □ 

Similarly, one obtains: 

Corollary 8.8. For a function class & and e > 0, the following are equivalent. 

1. fat e (J?modwi) < d; 

2. For every countable subclass J^"' of & ' , one has fat e (j?' \ Q \ A) < d for a suitable countable A (which 
depends on ' ). □ 

9. Proofs of two theorems from the Introduction 

Now we are in a position to prove the two main theorems 11.11 and 11.21 just by putting together various 
results established in the article. 
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9.1. Key to the proof of Theorem \l.l\ 

this is Theorem O 
©^©: Corollary E3 

©=>©: assume that for every d there is a countable subclass % of ^ with the property that the VC 
dimension of % is > d after removing any countable subset of il. Clearly, the countable class U^L^d will 
have infinite VC dimension outside of every countable subset of f2, a contradiction. 
©=>©: as a consequence of a classical result of Vapnik and Chervonenkis, every countable subclass 
is universal Glivenko-Cantelli with respect to all probability measures supported outside of some countable 
subset of fi, and a standard bound for the sample complexity s(S, e) only depends on d, from which the 
statement follows. 
®^©: trivial. 

©=^©: this is Theorem 14.51 and the only implication requiring Martin's Axiom. 

In the universally separable case, the implications © (7) are due to Theorem 18.31 ©=>(8) follows 
from CoroHarv l8.5[ (8)=J>(9) is standard, and (9)=>(1) trivial. □ 

9.2. Key to the proof of Theorem \1.2\ 
©^©: Corollary [HE 

©=>©: Assume that for some e > and every value d £ N there is a countable subclass of J? with the 
property that the e-fat shattering dimension of &d is > d after removing any countable subset of VL. Then 
the countable function class U d % 1 ^d will have infinite e-fat shattering dimension outside of every countable 
subset of f2, which is a contradiction. 

©=>©: Combining the assumption with Theorem 2.5 in one concludes that every countable subclass 
of & is universal Glivenko-Cantelli with respect to all probability measures supported outside of a 
suitable countable subset of fi, with a standard bound for the sample complexity s(S, e) only depending on 
die). 

©=»©: trivial. 

©^©: Theorem 14.51 This is the the only implication requiring Martin's Axiom. 

In the universally separable case, the equivalence of © and (7) is the statement of Theorem l8.4[ (7)=>(8) 
is Corollary 18.61 and (8)=>(9) is standard. □ 

Note again that the implication ©=>© is in general invalid, cf. Example 17.31 

10. Conclusion and Open Problems 

We have characterized concept classes ^ that are distribution-free PAC learnable under the family of 
all non-atomic probability measures on the domain. The criterion is obtained without any measurability 
conditions on the concept class, but at the expense of making a set-theoretic assumption in the form of 
Martin's Axiom. In fact, assuming Martin's Axiom makes things easier, and as this axiom is very natural, 
perhaps it deserves its small corner within the foundations of statistical learning. 

Generalizing the result over function classes, using a version of the fat shattering dimension modulo 
countable sets, did not pose particular technical difficulties. However the finiteness of this combinatorial 
parameter is no longer necessary for PAC learnability of a function class under non-atomic measures, just 
like it is the case for the classical distribution-free situation. 

It would be still interesting to know if the present results hold without Martin's Axiom, under the 
assumption that the concept class ^ is image admissible Souslin (Q, pages 186-187). The difficulty here is 
selecting a measurable learning rule C with the property that the images of all learning samples (a, C PI it), 
a € O", are uniform Glivenko-Cantelli. An obvious route to pursue is the recursion on the Borel rank of 
but we were unable to follow it through. 

Now, a concept class will be learnable under non-atomic measures provided there is a hypothesis class 
3%? which has finite VC dimension and such that every C £ ^ differs from a suitable H £ Jf? by a null 
set. If ^ consists of all finite and all cofinite subsets of fi, this J$? is given by {0, ft}. One may conjecture 
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that ^ is learnable under non-atomic measures if and only if it admits such a "core" having finite VC 
dimension. Is this true? 

Another natural question is: can one characterize concept classes that are uniformly Glivenko-Cantelli 
with respect to all non-atomic measures? Apparently, this task requires yet another version of shattering 



dimension, which is strictly intermediate between Talagrand's "witness of irregularity" [16[ and our VC 
dimension modulo countable sets. We do not have a viable candidate. 

Is it possible to construct an example of a concept class of finite VC dimension which is not consistently 
PAC learnable @, 0| without additional set-theoretical assumptions, just under the ZFC axiomatics? 

Finally, our investigation open up a possibility of linking learnability and VC dimension to Boolean 
algebras and their Stone spaces. This could be a glib exercise in generalization for its own sake, or maybe 
something deeper if one manages to invoke model theory and forcing. 
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